{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building Comprehensive AWS Index\n",
    "\n",
    "The AWS index files are not comprehensive, so if we want to match organizations to specific tax returns or have a comprehensive view of the space, we have to do this more manually.\n",
    "\n",
    "Before this notebook I have 1) downloaded the current AWS index files from 2015-2019 from https://s3.amazonaws.com/irs-form-990/index_20[XX].csv and 2) used the script from here: https://appliednonprofitresearch.com/posts/2020/06/skip-the-irs-990-efile-indices/ to retrieve a comprehensive list of files. If this is public I will be posting my lightly edited version of that script as well.\n",
    "\n",
    "This notebook:\n",
    "1. Chunks the comprehensive file list by year\n",
    "2. Retrieves basic info from 990s tht are missing in the AWS index\n",
    "3. Adds all the info into a new comprehensive index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import re\n",
    "import time\n",
    "\n",
    "from irsx.xmlrunner import XMLRunner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chunking Comprehensive File List\n",
    "\n",
    "This just makes everything a bit more manageable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change based on when you want to retrieve\n",
    "BEG_YR = 2015\n",
    "END_YR = 2019\n",
    "yr_lst = list( range( BEG_YR, END_YR + 1 ) )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First doing some managing of the result of the script by separating it into years\n",
    "# complete_file_list.csv is simply a one column (called \"file_name\") list.\n",
    "# This notebook assumes and saves all files in the local directory\n",
    "file_list = pd.read_csv( \"complete_file_list.csv\" )\n",
    "\n",
    "for yr in yr_list:\n",
    "    save_filename = \"file_list_\" + str( yr ) + \".csv\"\n",
    "    file_list[ file_list['file_name'].str.startswith( str( yr ) ) ].to_csv( save_filename, index=False )\n",
    "\n",
    "file_list = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Retrieve 990 Information\n",
    "\n",
    "I'm only going to retrieve the interpretable information that already exists in the AWS index files, that is, EIN, Name, and Form. As the above linked blog post points out, TAX PERIOD is unreliable. My guess is that if you're using this, you don't need to know tax period.\n",
    "\n",
    "Using IRSx for retrieval.\n",
    "\n",
    "This takes a very long time (on the order of days). Luckily I had other things I was doing this week, but hopefully will never have to do 6 years at a time again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_time = time.time()\n",
    "update_period = 5000 # How often you want to be updated about progress.\n",
    "\n",
    "for yr in yr_lst:\n",
    "    \n",
    "    str_yr = str( yr )\n",
    "    print( \"Starting with \" + str_yr + \" at \" + str( round( time.time() - start_time, 1 ) ) + \" seconds.\" )\n",
    "    \n",
    "    # Get comprehensive list of files for the year and create object ID from the file\n",
    "    file_lst_df = pd.read_csv( \"file_list_\" + str_yr + \".csv\" )\n",
    "    file_lst_df['OBJECT_ID'] = file_lst_df['file_name'].str[:18].astype( int )\n",
    "    \n",
    "    # Get AWS index to focus only on files that are not covered\n",
    "    aws_ind = pd.read_csv( \"index_\" + str_yr + \".csv\" )\n",
    "    file_lst_df = file_lst_df[~file_lst_df['OBJECT_ID'].isin( aws_ind['OBJECT_ID'] )]\n",
    "    \n",
    "    print( str_yr + \" list has \" + str( len( file_lst_df ) ) + \" missing.\" )\n",
    "    \n",
    "    # Iterate through, adding information to lists\n",
    "    xml_runner = XMLRunner()\n",
    "    yr_oid = []\n",
    "    yr_ein = []\n",
    "    yr_name = []\n",
    "    yr_form = []\n",
    "    counter = 0\n",
    "    for index, row in file_lst_df.iterrows():\n",
    "        \n",
    "        # Retrieve 990\n",
    "        try:\n",
    "            header_dat = xml_runner.run_sked( row['OBJECT_ID'], 'ReturnHeader990x').get_result()\n",
    "        except:\n",
    "            header_dat = None\n",
    "            \n",
    "        # Read in applicable data\n",
    "        try:\n",
    "            file_ein = header_dat[0]['schedule_parts']['returnheader990x_part_i']['ein']\n",
    "        except:\n",
    "            file_ein = 0\n",
    "        try:\n",
    "            file_name1 = header_dat[0]['schedule_parts']['returnheader990x_part_i']['BsnssNm_BsnssNmLn1Txt']\n",
    "        except:\n",
    "            file_name1 = ''\n",
    "        try:\n",
    "            file_name2 = header_dat[0]['schedule_parts']['returnheader990x_part_i']['BsnssNm_BsnssNmLn2Txt']\n",
    "        except:\n",
    "            file_name2 = ''\n",
    "        file_name_c = ( file_name1 + ' ' + file_name2 ).strip().lower()\n",
    "        try:\n",
    "            file_form = header_dat[0]['schedule_parts']['returnheader990x_part_i']['RtrnHdr_RtrnCd']\n",
    "        except:\n",
    "            file_form = ''\n",
    "\n",
    "        yr_oid.append( str( row['OBJECT_ID'] ) )\n",
    "        yr_ein.append( file_ein )\n",
    "        yr_name.append( file_name_c )\n",
    "        yr_form.append( file_form )\n",
    "        \n",
    "        # Update me about progress - one thing I did not do but might want to try is adding a pause in here\n",
    "        # the retrieval is always faster after a break.\n",
    "        if ( counter % update_period ) == 0:\n",
    "            print( str_yr + \" counter at: \" + str( counter ) + \" out of \" + str( len( file_lst_df ) ) +\\\n",
    "                   \" (\" + str( round( counter / len( file_lst_df ), 2 ) ) + \")\" +\\\n",
    "                   \" at \" + str( round( time.time() - start_time, 1 ) ) + \" seconds.\" )\n",
    "        counter += 1\n",
    "    \n",
    "    # Save file as a csv from a pandas dataframe\n",
    "    pd.DataFrame( {'OBJECT_ID': yr_oid, 'EIN': yr_ein, 'TAXPAYER_NAME': yr_name, 'RETURN_TYPE': yr_form} ).\\\n",
    "        to_csv( str_yr + \"_addit_index.csv\" )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Comprehensive Index\n",
    "\n",
    "This step could have been easily combined into the iteration above, but it's nice to be a bit chunked."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2015 AWS index file missing 31.16% of filings.\n",
      "2016 AWS index file missing 28.41% of filings.\n",
      "2017 AWS index file missing 15.56% of filings.\n",
      "2018 AWS index file missing 16.2% of filings.\n",
      "2019 AWS index file missing 26.08% of filings.\n"
     ]
    }
   ],
   "source": [
    "for yr in yr_lst:\n",
    "    aws_ind = pd.read_csv( \"index_\" + str( yr ) + \".csv\" )\n",
    "    aws_ind['INDEX_SRC'] = \"AWS INDEX\"\n",
    "    new_ind = pd.read_csv( str( yr ) + \"_addit_index.csv\", index_col=0 )\n",
    "    new_ind['INDEX_SRC'] = \"AWS FILE DIR\"\n",
    "    new_ind['TAXPAYER_NAME'] = new_ind['TAXPAYER_NAME'].str.upper()\n",
    "    aws_ind.append( new_ind ).to_csv( \"all_file_index_\" + str( yr ) + \".csv\", index=False )\n",
    "    print( str( yr ) + \" AWS index file missing \" +\\\n",
    "           str( round( 100 * len( new_ind ) / ( len( new_ind ) + len( aws_ind ) ), 2 ) ) +\\\n",
    "           \"% of filings.\" )"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
